Graphical data access method apparatus and system

ABSTRACT

Graphical memory access requests are routed to a plurality of bucket buffers. Filled bucket write buffers and empty bucket read buffers are efficiently emptied and filled respectively via a wide memory bus. The bucket sorting apparatus and method is used to increase the locality of memory references and pixel operations within a graphical rendering system. The increased locality increases graphical rendering performance and facilitates the usage of smaller z-buffers, larger tiles, and low-cost dynamic RAM within a graphics pipeline.

RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.10/086,481 entitled “BUCKET-SORTING GRAPHICAL RENDERING APPARATUS ANDMETHOD” and filed on Feb. 28, 2002 for David B. Buehler.

BACKGROUND OF THE INVENTION

1. The Field of the Invention

The present invention relates generally to graphical rendering devicesand systems. Specifically, the invention relates to devices and systemsfor conducting highly realistic three-dimensional graphical renderings.

2. The Relevant Art

Graphical rendering involves the conversion of one or more objectdescriptions to a set of pixels that are displayed on an output devicesuch as a video display or image printer. Object descriptions aregenerally mathematical representations that model or represent the shapeand surface characteristics of the displayed objects. Graphical objectdescriptions may be created by sampling real world objects and/or bycreating computer-generated objects using various editors.

In geometric terms, rendering requires representing or capturing thedetails of graphical objects from the viewer's perspective to create atwo-dimensional scene or projection representing the viewer'sperspective in three-dimensional space. The two-dimensional renderingfacilitates viewing the scene on a display device or means such as avideo monitor or printed page.

A primary objective of object modeling and graphical rendering isrealism, i.e., a visually realistic representation that is life-like.Many factors impact realism, including surface detail, lighting effects,display resolution, display rate, and the like. Due to the complexity ofreal-world scenes, graphical rendering systems are known to have aninsatiable thirst for processing power and data throughput. Currentlyavailable rendering systems lack the performance necessary to makephoto-realistic renderings in real-time.

To increase rendering quality and reduce storage requirements, surfacedetails are often separated from the object shape and are mapped ontothe surfaces of the object during rendering. The object descriptionsincluding surface details are typically stored digitally within acomputer memory or storage medium and referenced when needed.

One common method of representing three-dimensional objects involvescombining simple graphical objects into a more realistic composite modelor object. The simple graphical objects, from which composite objectsare built, are often referred to as primitives. Examples of primitivesinclude triangles, surface patches such as bezier patches, and voxels.

Voxels are volume elements, typically cubic in shape, that represent afinite, three-dimensional space similar to bitmaps in two-dimensionalspace. Three-dimensional objects may be represented using a primitivecomprising a three-dimensional array of voxels. A voxel object iscreated by assigning a color and a surface normal to certain voxellocations within the voxel array while marking other locations astransparent.

Voxel objects reduce the geometry bandwidth and processing requirementsassociated with rendering. For example, objects represented with voxelstypically have smaller geometry transform requirements than similarobjects constructed from triangles. Despite this advantage, existingvoxel rendering algorithms are typically complex and extremely hardwareintensive. A fast algorithm for rendering voxel objects with lowhardware requirements would reduce the geometry processing and geometrybandwidth requirements of rendering by allowing certain objects to berepresented by voxel objectss instead of many small triangles.

As mentioned, rendering involves creating a two-dimensional projectionrepresenting the viewer's perspective in a three-dimensional space. Onecommon method of creating a two-dimensional projection involvesperforming a geometric transform on the primitives that comprise thevarious graphical objects within a scene. Performing a geometrictransform changes any coordinates representing objects from an abstractspace known as a world space into actual device coordinates such asscreen coordinates.

After a primitive such as a triangle has been transformed to a devicecoordinate system, pixels are generated for each pixel location which iscovered by that primitive. The process of converting graphical objectsto pixels is sometimes referred to as rasterization or pixelization.Texture information may be accessed in conjunction with pixelization todetermine the color of each of the pixels. Because more than oneprimitive may be covering any given location, a z-depth for each pixelgenerated is also calculated, and is used to determine which pixels arevisible to the viewer.

FIGS. 1 a and 1 b depict a simplified example of graphical rendering.Referring to FIG. 1 a, a graphical object 100 may be rendered bysampling attributes such as object color, texture, and reflectivity atdiscrete points on the object. The sampled points correspond todevice-oriented regions, typically round or rectangular in shape, knownas pixels 102. The distance between the sampled points is referred toherein as a sampling interval 104. The sampled attributes, along withsurface orientation (i.e. a surface normal), are used to compute arendered color 108 for each pixel 102. The rendered colors 108 of thepixels 102 preferably represent what a perspective viewer 106 would seefrom a particular distance and orientation relative to the graphicalobject 100.

As mentioned, the attributes collected by sampling the graphical object100 are used to compute the rendered color 108 for each pixel 102. Therendered color 108 differs from the object color due to shading,lighting, and other effects that change what is seen from theperspective of the viewer 106. The rendered color 108 may also beconstrained by the selected rendering device. The rendered color may berepresented by a set of numbers 110 designating the intensity of each ofthe component colors of the selected rendering device, such as red,green, and blue on a video display or cyan, magenta, yellow, and blackon an inkjet printer.

As the graphical object 100 is rendered with each frame, the positioningand spacing of the discreet sampling points (i.e., the pixels 102)projected onto the graphical object 100 determine what is seen by theperspective viewer 106. One method of rendering, referred to as raytracing, involves determining the position of the discreet samplingpoints by extending a grid 111 of rays 112 from a focal point 114 tofind the closest primitive each ray intersects. Since the rays 112 arediverging, the spacing between the rays 112, and therefore the size ofthe grid 111, increases with increasing distance. Ray tracing, whileprecise and accurate, is generally not used in real-time renderingsystems due to the computational complexity of currently available raytracing algorithms.

The grid 111, depicted in FIG. 1 a, is a set of regularly spaced pointscorresponding to the pixels 102. The points of the grid 111 lie in animage plane perpendicular to a ray axis 115. The distance of each pixel102 from a reference plane perpendicular to the ray axis 115, such asthe grid 111, is known as the pixel depth or z-depth. The distance ordepth of the graphical object 100 changes the level of detail seen bythe perspective viewer 106. Relatively distant objects cover a smallerrendering area on the display device, resulting in a reduced number ofrays 112 that reach the graphical object 100, and an increased samplinginterval 104.

Visual artifacts occur when the spacing between the rays 112 result inthe sampling interval 104 being too large to faithfully capture thedetails of the graphical object 100. A number of methods have beendeveloped to eliminate visual artifacts related to large samplingintervals. One method, known as super-sampling, involves rendering thescene at a higher resolution than the resolution used by the outputdevice, followed by a smoothing or averaging operation to combinemultiple rendered pixels into a single output pixel.

Another method, developed to represent objects at various distances andsampling intervals faithfully, involves creating multiple models of agiven object. Less detailed models are used when an object is distant,while more detailed models are used when an object is close. Textureinformation may also be stored at multiple resolutions. Duringrendering, the texture map appropriate for the distance from the vieweris utilized.

The graphical objects, and portions thereof, that are visible to aviewer are dependent upon the perspective of the viewer. Referring toFIG. 1 b, a graphical scene 150 may include a variety of the graphicalobjects 100, some of which may be visible while others may beobstructed. Unobstructed objects are often designated as foregroundobjects 100 a, while partially obstructed objects may be referred to asbackground objects 100 b. Within the graphical scene 150, completelyobstructed objects may be referred to as non-visible objects.

During rendering, the graphical scene 150 is converted to renderedpixels on a rendering device for observance by an actual viewer. Eachrendered pixel preferably contains the rendered color 108 such that theactual viewer's visual perception of each graphical object 100 is thatof the perspective viewer 106.

A small percentage of the graphical objects 100 may be visible within aparticular graphical scene. For example, the room shown within thegraphical scene 150 may be one of many rooms within a databasecontaining an entire virtual house. The rendering of non-visible objectsand pixels unnecessarily consumes resources such as processing cycles,memory bandwidth, memory storage, and function specific circuitry. Sincethe relative relationship of graphical objects changes with differingperspectives, for example as the perspective viewer 106 walks through avirtual house, the ability to dynamically determine and prunenon-visible objects and pixels improves rendering performance.

Ray casting is a method to determine visible objects and pixels within agraphical scene 150 as shown in FIG. 1 a. Ray casting is one method ofconducting ray tracing that advances (casts) one ray for each pixelwithin the graphical scene 150 from the perspective viewer 106. Witheach cast one or more graphical objects are tested against each ray tosee if the ray has “collided” with the object—an extremelyprocessing-intensive procedure.

Z-buffering is another method that is used to determine visible pixels.Pixels are generated from each potentially visible object and storedwithin a z-buffer. A z-buffer typically stores a depth value and a pixelcolor value at a memory location corresponding to each x, y positionwithin the graphical scene 150. A pixel color value is overwritten witha new value only if the new pixel depth is less than the depth of thecurrently stored pixel.

Referring to FIG. 2, a method of rendering known as post z-buffershading and texturing defers shading and texturing operations within arendering pipeline 200 and therefore does not texture or shadenon-visible pixels. In a typical rendering system, the color of thepixels is calculated prior to z-buffering. In a post z-buffer shadingand texturing system, such as the rendering pipeline 200, final colorcalculations are not performed until after the z-buffering operation.Deferred shading and texturing eliminates the memory lookups andprocessing operations associated with shading and texturing non-visiblepixels and thereby facilitates increased system efficiency.

The rendering pipeline 200 includes a display memory 210 and a graphicsengine 220 comprised of a triangle converter 230, a z-buffer 240, and ashading and texturing engine 250. The rendering pipeline 200 alsoincludes a frame buffer 260. In the depicted embodiment, the displaymemory 210 receives and provides various object descriptors 212 thatdescribe the graphical objects 100.

The display memory 210 preferably contains descriptions of those objectsthat are potentially visible in the graphical scene 150. With scenechanges, the object descriptors 212 may be added or removed from thedisplay memory 210. In some embodiments, the display memory 210 containsa database of the object descriptors 212, for example, a databasedescribing an entire virtual house.

Some amount of simple pruning may be conducted on objects within thedisplay memory 210, for example, by software running on a hostprocessor. Simple pruning may be conducted so that the graphical objectsthat are easily identified as non-visible are omitted from the renderingprocess. For example, those graphical objects 100 that are completelybehind the perspective viewer 106 may be omitted or removed from thedisplay memory 210.

The graphics engine 220 retrieves the object descriptors 212 from thedisplay memory 210 and presents them to the triangle converter 230. Inthe depicted embodiment, the object descriptors 212 define the verticesof a triangle or set of triangles and their associated attributes suchas the object color. Typically, these attributes are interpolated acrossthe face of the triangle to provide a set of potentially visible pixels232.

The potentially visible pixels 232 are received by the z-buffer 240 andprocessed in the manner previously described to provide the visiblepixels 242 to the shading and texturing engine 250. The shading andtexturing engine 250 textures and/or shades the visible pixels 242 toprovide rendered pixels 252 that are collected by the frame buffer 260to provide one frame of pixels 262. The framed pixels 262 are typicallysent to a display system for viewing.

One difficulty in conducting post z-buffer shading and texturing is theincreased complexity required of the z-buffer. The z-buffer is requiredto contain additional information relevant to shading and texturing inaddition to the pixel depth. The z-buffer is often a performancecritical element, in that each pixel is potentially updated multipletimes, requiring increased bandwidth. The increased size and bandwidthrequirements on the z-buffer have limited the use of post z-buffershading and texturing within graphical systems.

One prior art method to reduce the size of the z-buffer is shown in FIG.3. The method divides a screen 300 into tiles 310. The tiles 310 and thescreen 300 consist of a plurality of scanlines 320. Each tile 310 isrendered as if it were the entire screen 300, thus requiring atile-sized z-buffer. While a tile-sized z-buffer requires less memory, atile-sized z-buffer increases complexity related to sorting, storing,accessing, and rendering the object descriptors 212 within the displaymemory 210. The increased complexity results from objects that overlapmore than one tile.

While many advances have been made to graphical rendering algorithms andarchitectures, including those depicted in the graphical pipeline 200,real-time rendering of photo-realistic life-like scenes requires theability to render greater geometric detail than is sustainable oncurrently available graphical rendering systems.

Therefore, what is generally needed are methods and apparatus to conductefficient graphical rendering. Specifically, what is needed is agraphical system that renders voxel primitives efficiently. The abilityto render voxel objects efficiently increases the detail achievable inreal-time graphical rendering systems.

What is also needed is a graphical system that renders very detailedscenes with extensive depth complexity, without tying up external memoryinterfaces with z-buffer data traffic. A z-buffering apparatus andmethod that facilitates large tiles, supports a high pixel throughput,is compact enough to reside entirely on-chip, and reduces externalmemory bandwidth requirements would facilitate such a system.

In addition to better z-buffering, a method and apparatus are neededthat reduce the bandwidth load on the z-buffer. Specifically, what isneeded is a method and apparatus that reduces the generation ofnon-visible pixels prior to z-buffering.

In addition to more intelligent pixel generation, rendering highlyrealistic scenes requires accessing large amounts of texture and worlddescription data. Specifically, what is needed is an apparatus andmethod to maximize the efficiency of internal and external memoryaccesses. Such a method and apparatus would preferably achieve increasedrealism by facilitating larger stores of texture data within low-costexternal memories, while maintaining a high data throughput within therendering pipeline.

Lastly, what is needed is a graphical processing architecture thatfacilitates combining the various elements of the present invention intoan efficient rendering pipeline that is scalable in performance.

OBJECTS AND BRIEF SUMMARY OF THE INVENTION

The apparatus of the present invention has been developed in response tothe present state of the art, and in particular, in response to theproblems and needs in the art that have not yet been fully solved bycurrently available graphical rendering systems and methods.Accordingly, it is an overall object of the present invention to providean improved method and apparatus for graphic rendering that overcomesmany or all of the above-discussed shortcomings in the art.

To achieve the foregoing objects, and in accordance with the inventionas embodied and broadly described herein in the preferred embodiments,an apparatus and method for improved graphical rendering is described.The apparatus and method facilitate increased rendering realism bysupporting greater geometric detail, efficient voxel rendering, largeramounts of usable texture data, higher pixel resolutions includingsuper-sampled resolutions, increased frame rates, and the like.

In a first aspect of the invention, a method and apparatus for castingray bundles is described that casts entire bundles of rays relativelylarge distances. The ray bundles are subdivided into smaller bundles andcasting distances as the rays and bundles approach a graphical object.Each bundle advances in response to a single test that is conductedagainst a proximity mask corresponding to a particular proximity.Sharing a single proximity test among all the rays within a bundlegreatly reduces the processing burden associated with ray tracing.Individual rays are generated when a ray bundle is within closeproximity to the object being rendered. The method and apparatus forcasting ray bundles efficiently calculates the first ray intersectionswith an object and is particularly useful for voxel objects.

In a second aspect of the invention, a method and apparatus for gatedpixelization (i.e., selective pixel generation) is described thatconducts z-buffering at a coarse depth resolution using minimum andmaximum depths for a pixel set. In one embodiment, the method andapparatus for gated pixelization maximizes the utility of reduced depthresolution by shifting the range of depths stored within the z-buffer incoordination with the depth of the primitives being processed. Themethod and apparatus for gated pixelization also reduces the bandwidthand storage burden on the z-buffer and increases the throughput of thepixel generators.

In a third aspect of the invention, a method and apparatus forz-buffering pixels is described that stores and sorts the pixels from anarea of the screen, such as a tile, into relatively small regions, eachof which is processed to determine the visible pixels in each region.The method and apparatus facilitates high throughput z-buffering,efficient storage of pixel auxiliary data, as well as deferred pixelshading and texturing.

In a fourth aspect of the invention, an apparatus and method for sortingmemory accesses related to graphical objects is described that increasesthe locality of memory references and thereby increases memorythroughput. In the presently preferred embodiment, access requests for aregion of the screen are sorted and stored according to address, thenaccessed page by page to minimize the number of page loads that occur.Minimizing page loads maximizes the utilization of available bandwidthof graphical memory interfaces.

The various aspects of the invention are combined in a pipelinedgraphics engine designed as a core of a graphics subsystem. In thepresently preferred embodiment, graphical rendering is tile-based andthe pipelined graphics engine is configured to efficiently conducttile-base rendering.

The graphics engine includes a set of pixel generators that operate inconjunction with one or more occlusion detectors. The pixel generatorsinclude voxel ray tracers, which use the method and apparatus forcasting ray bundles to greatly reduce the number of computationsrequired to determine visible voxels. In the preferred embodiment, thevoxel objects are stored and processed in a compressed format.

The voxel ray tracers generate pixels from voxel objects by calculatingray collisions for the voxel objects being rendered. Proximity masks arepreferably generated previous to pixel generation. Each proximity maskindicates the voxel locations that are within a certain distance of anontransparent voxel. The proximity masks are brought in from externalmemory and cached as needed during the rendering process. An addressthat references the color of the particular voxel impinged upon by eachray is also calculated and stored within a pixel descriptor.

The voxel ray tracers conduct ray bundle casting to efficientlydetermine any first ray intersections with a particular voxel object.The voxel ray tracers are preferably configured to conduct perspectiveray tracing where the rays diverge with each cast.

Ray tracing commences by initializing the direction of the rays in thevoxel object's coordinate system, based on the voxel object'sorientation in world space and the location of the viewer. The castingdirection of each ray bundled is represented by a single directionalvector. A bundle width and height corresponding to a screen regionrepresent the bundle size. In the preferred embodiment, a top levelbundle may comprise 100 or more rays.

Each ray bundle is advanced by casting the bundle in the directionspecified by the directional vector a selected casting distance. Aproximity mask is selected for testing that preferably indicates aproximity to the object surface that corresponds with the selectedcasting distance. The single test against the properly selectedproximity mask ensures that none of the rays in a bundle could haveintersected the object between the last test and the current test.

A positive proximity test indicates that at least one ray is within acertain distance of the object surface. In response to a positiveproximity test, the ray bundle is preferably subdivided into smallerbundles that are individually advanced, tested, and subdivided untileach bundle is an individual ray. The individual rays are also advancedand tested against a collision mask that indicates impingement of theray on a non-transparent voxel of the object of interest. Uponimpingement, a color lookup address for the impinged voxel iscalculated, and stored along with x and y coordinates in the pixeldescriptor.

The method and apparatus for casting ray bundles has several advantagesand is particularly useful for voxel objects. Casting is very efficient,in that the majority of tests performed (for each ray that intersectsthe surface) are shared by many other rays within each bundle the raywas a member of. The proximity mask information is compact, particularlywhen compressed, and may be cached on-chip for increased efficiency. Thealgorithm is also memory friendly, in that only those portions of theobject that are potentially visible need be brought onto the chip i.e.efficiency is maintained with partial view rendering. Perhaps thegreatest advantage, particularly when conducted in conjunction withvoxel objects, is a substantial reduction in the number of, and thebandwidth required for, geometry calculations within highly detailedscenes. The recursive subdividing nature of the algorithm alsofacilitates parallel execution, which in certain embodiments facilitatescomputing multiple ray intersections per compute cycle.

The pixel generators, such as the voxel ray tracers, generatepotentially visible pixels, working in conjunction with the occlusiondetector. The occlusion detector conducts depth checking at a coarsedepth resolution in order to gate the pixel generators, thereby allowingthe pixel generators to skip generating pixels for locations known to beoccluded by a previously processed pixel. The preferred embodiment ofthe occlusion detector performs a parallel comparison of all the depthvalues within a region to a given value, and returns a mask indicatingthe pixel locations that are occluded at that depth. The pixelgenerators use the mask information to generate only pixels that are notknown to be occluded. Using the occlusion detectors to conduct pixelgating reduces the overall processing and storage burden on thez-buffer.

In the preferred embodiment, the occlusion detector is used inconjunction with front-to-back rendering of the graphical primitivesthat comprise a scene. In certain embodiments, the occlusion detector iscapable of shifting the depth range in which occlusions are detected.Depth shifting focuses the available resolution of the occlusiondetector on a limited depth range. Depth shifting is preferablyconducted in conjunction with depth ordered rendering. Information fromthe occlusion detector may also be used to gate the processing ofgeometric primitives.

The pixel generators and the occlusion detectors coordinate to conductgated pixelization and provide potentially visible pixels to a sortingz-buffer. The sorting z-buffer includes a region sorter, a regionmemory, and a region-sized z-buffer. The region sorter sorts thepotentially visible pixels according to their x,y coordinates within ascreen or tile to provide sorted pixels. The sorted pixels correspondingto each region within a graphical scene or tile are received andprocessed by a region-sized z-buffer to provide the visible pixels.

In the preferred embodiment, the region sorter is a hardware bucketsorter. The bucket sorter operates by storing the pixels as they arrivein temporary buffers, which are transferred in parallel into the regionmemory when full. Additional stages of bucket sorting may be conductedby sorting pixels stored within the region memory.

Sorting the pixels into regions facilitates the use of a very smallz-buffer at the core of the sorting z-buffer. The screen regionscorresponding to the region-sized z-buffer are preferably smaller thanthe tiles typical of rendering systems. Sorting the pixels into regionsalso facilitates the use of larger tiles. Larger tiles reduce the numberof graphic primitives that overlap more than one tile.

In one embodiment, using a region-sized z-buffer within the sortingz-buffer facilitates rendering without tiling. Using a region-sizedz-buffer has the additional advantage of facilitating dynamic adjustmentof the size of the tile, as well as handling more than one pixel in thez-buffer for a given location within the region—a useful feature forprocessing semi-transparent pixels. Using a region-sized z-buffer alsofacilitates handling a large number of pixels per cycle. The pixels maybe randomly placed within a tile and need not be stored or accessed inany particular order.

In the preferred embodiment, the bucket sorter stores the receivedpixels by conducting a parallel transfer to the region memory. Since thepixels may originate from the same primitive, the received pixels oftenhave a certain amount of spatial coherence. In the preferred embodiment,the bucket sorter exploits spatial coherence by conducting a first levelof bucket sorting as the pixels arrive. Additional levels of bucketsorting may be performed by recursively processing the contents of theregion memory.

A further stage of the sorting z-buffer is the pixel combiner. The pixelcombiner monitors the pixels provided by the sorting z-buffer. In thoseinstances where super-sampled anti-aliasing is performed, combining isconducted on those pixels that can be combined without loss of visualquality. Combining is preferred for super-sampled pixels combinedwithout loss of visual quality. Combining is preferred for super-sampledpixels that reference the same texture. Combining reduces the load onthe colorization engine and the anti-aliasing filter.

The sorting z-buffer provides visible pixels to a colorization engine.The colorization engine colorizes the pixels to provide colorizedpixels. In the present invention, colorizing may comprise any operationthat affects the rendered color of a pixel. In one embodiment, thecolorizing of pixels includes shading, texturing, normal perturbation(i.e. bump mapping), as well as environmental reflectance mapping.Colorizing only those pixels that are visible reduces the processingload on the colorization engine and reduces the bandwidth demands onexternal texture memory.

The colorization engine colorizes pixels using a set of pixelcolorizers, an attribute request sorter, and a set of attribute requestqueues. The graphics engine may also include or be connected to a pixelattribute memory containing pixel attributes that are accessed by thepixel colorizers in conjunction with colorization. Voxel color data ispreferably stored in a packed array so that only nontransparent voxelson the surface of an object need be stored. Surface normal informationis also stored along with the color.

The attribute request sorter routes and directs the attribute requestsrelevant to pixel colorization to the various attribute request queues.In one embodiment, the attribute request sorter sorts the attributerequests according to the memory page in which the requested attributeis stored, and the attribute request sorter routes the sorted requeststo the pixel attribute memory.

Sorting the attribute requests increases the performance and/orfacilitates the use of lower cost storage by increasing the locality ofmemory references. In one embodiment, increasing the locality of memoryreferences facilitates using greater quantities of slower, less costlydynamic random access memory (DRAM) within a memory subsystem whilemaintaining equivalent data throughput.

In the preferred embodiment, the last portion in the pipeline is theanti-aliasing filter. In those instances where super-sampling isperformed, multiple super-sampled pixels are combined to providerendered pixels. The rendered pixels are stored in the frame buffer andused to provide a high quality graphical rendering.

The various elements of the graphics engine work together to accomplishhigh performance, highly detailed rendering using reduced systemresources. Pixel descriptors are judiciously generated in the pixelizersby conducting gated pixelization. Each pixel descriptor, though groupedwith other pixels of the same screen region, flows independently throughthe various pipeline stages. Within each pipeline stage, the number ofprocessing units operating in parallel is preferably scalable in thateach pixel is directed to an available processing unit.

BRIEF DESCRIPTION OF THE DRAWINGS

In order that the manner in which the advantages and objects of theinvention are obtained will be readily understood, a more particulardescription of the invention briefly described above will be rendered byreference to specific embodiments thereof which are illustrated in theappended drawings. Understanding that these drawings depict only typicalembodiments of the invention and are not therefore to be considered tobe limiting of its scope, the invention will be described and explainedwith additional specificity and detail through the use of theaccompanying drawings in which:

FIG. 1 a is partially schematic respective view depicting a prior artmethod of rendering a graphical object;

FIG. 1 b is a perspective view of a graphical scene in accordance withgraphical rendering systems;

FIG. 2 is a schematic block diagram depicting a prior art graphicspipeline;

FIG. 3 is a chart depicting a prior art tile-based rendering method;

FIG. 4 a is a schematic block diagram depicting one embodiment of agraphical rendering system in accordance with the invention;

FIG. 4 b is a schematic block diagram depicting one embodiment of agraphics subsystem in accordance with the present invention;

FIG. 5 is a schematic block diagram depicting one embodiment of agraphical rendering apparatus of the present invention;

FIG. 6 is a schematic block diagram depicting one embodiment of agraphical rendering method of the present invention;

FIG. 7 is a schematic block diagram depicting one embodiment of a pixelgeneration apparatus of the present invention;

FIG. 8 a is a schematic block diagram depicting one embodiment of atriangle pixelization apparatus of the present invention;

FIG. 8 b is a flow chart diagram depicting one embodiment of a trianglepixelization method of the present invention;

FIG. 8 c is an illustration depicting the results of one embodiment ofthe triangle pixelization method of the present invention;

FIG. 9 is a schematic block diagram depicting one embodiment of a raytracing apparatus of the present invention;

FIG. 10 a is a schematic block diagram depicting one embodiment of aproximity testing apparatus of the present invention;

FIG. 10 b is a schematic block diagram depicting one embodiment of acollision testing apparatus of the present invention;

FIG. 11 is a schematic block diagram depicting one embodiment of acasting apparatus of the present invention;

FIG. 12 is a schematic block diagram depicting one embodiment of a raycasting method of the present invention;

FIG. 13 a is a flow chart diagram depicting one embodiment of aproximity mask generation method in accordance with the presentinvention;

FIG. 13 b is a side view of an object being rendered;

FIG. 13 c-g are illustrations of various stages in the mask generationprocess;

FIGS. 14, 15, and 16 are illustrations depicting the operation ofvarious embodiments of the ray casting method of FIG. 12;

FIG. 17 a is a schematic block diagram depicting one embodiment of anocclusion detection apparatus of the present invention;

FIG. 17 b is a flow chart diagram depicting one embodiment of anocclusion detection method of the present invention;

FIG. 18 a is a schematic block diagram depicting one embodiment of abucket sorting apparatus of the present invention;

FIG. 18 b is a schematic block diagram depicting an on-chip embodimentof a bucket sorting apparatus of the present invention;

FIG. 19 is a flow chart diagram depicting one embodiment of a bucketsorting method of the present invention;

FIG. 20 a is a schematic block diagram depicting one embodiment of asorting z-buffer apparatus of the present invention;

FIG. 20 b is a flow chart diagram depicting one embodiment of a sortingz-buffer method of the present invention;

FIG. 21 a is a schematic block diagram depicting one embodiment of agraphics memory localization apparatus of the present invention;

FIG. 21 b is a flow chart diagram depicting one embodiment of a graphicsmemory localization method of the present invention;

FIG. 22 is a schematic block diagram depicting one embodiment of a pixelcolorization apparatus of the present invention; and

FIG. 23 is a flow chart diagram depicting one embodiment of a pixelcolorization method of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring to FIG. 4 a, a digital media system 400 in accordance with thepresent invention may include a CPU 410, a storage device 420, a memory430, an audio subsystem 440, and a graphics subsystem 450,interconnected by a system bus 412. In addition, the graphical renderingsystem 400 may include speakers 445 and a video display 455. In thedepicted embodiment, the speakers 445 receive and play an audio signal442 from the audio subsystem 440, while the video display 455 receivesand displays a video signal 452 from the graphics subsystem 450. Thedigital media system 400 may be a multimedia system such as a gameconsole or personal computer.

Referring to FIG. 4 b, one embodiment of the graphics subsystem 450 inaccordance of the present invention includes a transform engine 460, adisplay memory 470, a graphics engine 480, and a frame buffer 490. Thetransform engine 460 receives data such as the object descriptors 212from the system bus 412. In the preferred embodiment, the transformengine 460 converts the coordinates associated with the objectdescriptors 212 into screen coordinates such as those seen by theperspective viewer 106. The display memory 470 stores the objectdescriptors 212 and provides them to the graphics engine 480.

The graphics engine 480 converts the object descriptors 212 to renderedpixels 482, while the frame buffer 490 and associated circuitry convertsthe rendered pixels 482 to the video signal 452. In one embodiment, thedisplay memory 470 is substantially identical to the (prior art) displaymemory 210 and the frame buffer 490 is substantially identical to the(prior art) frame buffer 260.

FIG. 5 is a schematic block diagram depicting one embodiment of thegraphics engine 480 of the present invention. The graphics engine 480may be embodied in hardware, software or a combination of the two. Inthe preferred embodiment, the graphics engine 480 is pipelined,operating on batches of pixels corresponding to a single tile. Forexample, the sorting z-buffer may operate on objects or pixelscorresponding to a first tile, while the colorizing engine works onpixels corresponding to a second tile. When the colorizing engine hasfinished colorizing the pixels, the pixels are sorted into screen orderand antialiased, generating rendered pixels.

In the depicted embodiment, the graphics engine 480 includes a set ofpixel generators 510 that operate in conjunction with one or moreocclusion detectors 520 to conduct gated pixelization. The pixelgenerators 510 receive the object descriptors 212 and providepotentially visible pixels 512 to a sorting z-buffer 530. The occlusiondetectors 520 gate the pixelization conducted by the pixel generators bymaintaining a current occlusion depth for each pixel position.

As shown in FIG. 4, the object descriptors 212 may be provided by thedisplay memory 470. The object descriptors 212 describe graphicalobjects, such as the graphical object 100 of FIG. 1. Each object may becomposed of multiple sub-objects or primitives such as triangles, bezierpatches, and voxel arrays. In the preferred embodiment, each sub-objectcorresponds to one object descriptor 212 resulting in multiple objectdescriptors 212 for those objects that are composed of multiplesub-objects.

Processing is preferably conducted on each object descriptor 212independent of other object descriptors. For purposes of clarity, thedescription of this invention typically implies a single objectdescriptor 212 for each graphical object 100, though multiple objectdescriptors 212 are preferred for each graphical object 100.

The object descriptors 212 are typically stored within the displaymemory 470 as a collection of display lists. In the preferredembodiment, each display list corresponds to a tile. The descriptors forobjects (or primitives) that overlap multiple tiles are placed in morethan one display list, each list is sorted in order of depth, and theobject descriptors 212 are sorted in tile and depth order. In oneembodiment, display list sorting to provide tile and depth ordering isconducted by the transform engine 460. Tile and depth ordering ispreferred to increase efficiency, but is not required. Collectively, theobject descriptors 212 describe a graphical scene such as the graphicalscene 150.

Referring again to FIG. 5, the occlusion detector 520 receives a pixelset descriptor 514, including depth information, and provides a pixelset mask 522. In one embodiment, the pixel set descriptor describes ahorizontal span of consecutive pixels. The pixel set mask 522 preferablycomprises one bit per pixel location within the pixel set defined by thepixel set descriptor 514. The pixel set mask 522 indicates which pixelswithin the pixel set are potentially visible or alternately, whichpixels locations were previously rendered at a shallower depth, andtherefore need not be rendered.

The pixel generators 510 coordinate with the occlusion detectors 520 toprune or gate pixels that are known to be occluded and in responseprovide the potentially visible pixels 512. Conducting gatedpixelization, via the occlusion detectors 520, reduces the processingand storage burden on the graphics engine 480, particularly the pixelgenerators 510, and reduces the required size of the sorting z-buffer530.

The sorting z-buffer 530 receives the potentially visible pixels fromthe pixel generators 510. The sorting z-buffer 530 sorts the potentiallyvisible pixels into regions to facilitate using a relatively smallz-buffer referred to as a region-sized z-buffer 545. The sorted pixelsare processed one region at a time, by the region-sized z-buffer 545 toprovide visible pixels 532. In certain embodiments, where pixeltransparency is supported, multiple pixel descriptors for the same pixellocation are provided to the colorization engine 550.

The colorization engine 550 colorizes the visible pixels 532 to providecolorized pixels 552. Colorizing the pixels may involve a wide varietyof operations that effect the final rendered color of each pixel. In oneembodiment, colorizing the pixels includes operations selected fromtexturing, shading, environmental reflectance mapping, and shadowing.

The colorized pixels 552 are filtered by an anti-aliasing filter 570 toprovide the rendered pixels 482. The graphics engine 480 also includes apixel attribute memory 580 containing information such as texture maps,color tables, and the like. The information within the pixel attributememory 580 is used by the colorization engine 550 to conduct colorizingoperations.

As depicted in FIG. 5, the sorting z-buffer 530 includes a region sorter535, a region memory 540, and a region-sized z-buffer 545. The regionsorter 535 receives the potentially visible pixels 512 and groups thepixels into regions based on their x,y coordinates within the graphicalscene 150. In one embodiment, the region sorter 535 is a bucket sorterthat uses selected high order bits of the x and y coordinates as asorting key to sort the potentially visible pixels 512.

In the depicted embodiment, the potentially visible pixels 512 aredistributed into the region memory 540 via a memory bus 542 to locationsthat correspond to specific regions within the graphical scene 150. Inone embodiment the region memory locations are dynamically allocated tospecific regions and are accessed via a linked list. The sorted pixels537 corresponding to a region within the graphical scene 150 are removedfrom the region memory 540 by the region sorter 535 and are processed bythe region-sized z-buffer 545 to provide the visible pixels 532.

Sorting the pixels into regions facilitates the use of a very smallz-buffer. The screen regions corresponding to the region-sized z-buffer545 are preferably smaller than, and aligned with, the tiles 310. In oneembodiment, multiple pass hyper-sorting is conducted such that eachregion is a single pixel and the region-sized z-buffer 545 isessentially a register.

Sorting the pixels into regions also facilitates the use of larger tileswithin a rendering system. Larger tiles reduce the processing load onthe graphics engine 480, as a greater fraction of the primitivescomprising the graphical objects 100 are contained within a singlegraphical tile 310. In one embodiment, the tile 310 is equivalent to thescreen 300.

The region-sized z-buffer 545 preferably stores a pixel for each x, yposition within a region of the graphical scene 150. A pixel isoverwritten only if it has a pixel depth that is less than the depth ofthe currently stored pixel. After processing all of the sorted pixels537 corresponding to a region, the pixels remaining within theregion-sized z-buffer 545 are presented as the visible pixels 532.

The sorting z-buffer 545 facilitates the usage of complex pixeldescriptors while using a relatively small local memory. Another benefitof the sorting z-buffer 545 is the ability to conduct deferred shadingand texturing while significantly reducing external memory accesses. Thesorting z-buffer 545 also minimizes the processing load on the rest ofthe graphics pipeline 480, particularly the colorization engine 550.

The colorization engine 550 depicted in FIG. 5 includes a set of pixelcolorizers 555, an attribute request sorter 560, and a set of attributerequest queues 565. The pixel colorizers 555 receive the visible pixels532 including descriptive information used to colorize the pixels. Thedescriptive information is used to generate attribute requests 557 thatare sent to the attribute request sorter 560.

The attribute request sorter 560 sorts and directs the attributerequests 557 to the attribute request queues 565. In one embodiment, theattribute request sorter sorts the attribute requests 557 according tothe memory page in which the requested attribute is stored. Theattribute request sorter 560 also directs the sorted requests to provideone or more sorted attribute requests 562 the pixel attribute memory580. The pixel attribute memory 580 receives the sorted attributerequests 562 and provides one or more pixels attributes 582.

Sorting the attribute requests increases the effective bandwidth toexternal storage by increasing the locality of memory references. Thisfacilitates the use of a larger amount of slower, lower cost memory withthe same effective bandwidth as faster memory, or greater texturestorage bandwidth with the same memory technology. It allows complexmultiple lookup texturing and shading algorithms to be conductedefficiently by repeatedly X calculating the address of the next itemdata to be looked up then looking them all up in batches between sortingsteps.

The pixel attributes 582 are received by the pixel colorizers 555 andare used to colorize the visible pixels 532. Colorizing only visiblepixels reduces the processing load on the graphics engine 480. In oneembodiment, colorization comprises shading, texturing including surfacenormal perturbation, as well as bi-directional reflectance data lookupfor shading.

The various mechanisms of the graphics engine 480 work together toaccomplish high performance rendering using reduced system resources. Incertain embodiments, the reduced usage of resources facilitates thesuper-sampling of pixels, which is preferred when rendering voxelobjects. Super-sampling involves rendering at a resolution that is toodetailed to be displayed by the output device, followed by filtering anddown-sampling to a lower resolution image that is displayable by theoutput device.

For example, in one embodiment, super-sampling involves generating a 3×3grid of super-sampled pixels for each pixel displayed. The 3×3 grid ofsuper-sampled pixels are low-pass filtered and down-sampled by theanti-aliasing filter 570 to provide the rendered pixels 482.Super-sampling increases image quality but also significantly increasesthe processing and storage requirements of graphical systems.

Referring to FIG. 6, one embodiment of a graphical rendering method 600may be conducted independently of, or in conjunction with, the graphicsengine 480. The graphical rendering method 600 may be conducted inhardware, software, or a combination of the two. The graphical renderingmethod 600 commences with a start step 610 followed by a generate step620. The generate step 620 provides potentially visible pixels from adescriptor such as the object descriptor 212.

The graphical rendering method 600 proceeds from the generate step 620to a sort step 630. The sort step 630 sorts pixels such as thepotentially visible pixels 512 into a plurality of screen regions. Inone embodiment, the sort step 630 sorts using the most significant bitsof each pixel's x, y coordinates.

The sort step 630 is followed by a z-buffer region step 640. Thez-buffer region step 640 may be conducted in conjunction with theregion-sized z-buffer 545. The z-buffer region step 640 retains thepixel with the shallowest depth for each unique x, y coordinate in ascreen region. If transparency is being used, more than one pixel perx,y, coordinate may be retained and sent on to the colorizing engine.The level of transparency for each pixel is preferably known at thispoint. The z-buffer region step 640 is preferably repeated for eachscreen region referenced in the sort step 630.

After the z-buffer region step 640, the graphical rendering method 600proceeds to a sort step 650. Attribute requests are calculated based onthe memory location of the texture or other information required todetermine the color of each pixel. The sort step 650 sorts multipleattribute requests to increase the locality of memory references, whichmaximizes the rate at which data is transferred from internal orexternal memory by minimizing the number of new DRAM page accesses. Thesort step 650 is followed by a retrieve step 660, which retrieves therequested pixel attributes.

The retrieve step 660 is followed by a colorize step 670 and a filterstep 680. The colorize step 670 uses the pixel attributes to color,texture, and shade pixels to provide colorized pixels. The filter step680 removes aliasing effects by filtering the colorized pixels. Thegraphical rendering method 600 terminates at an end step 690.

As mentioned, the graphical rendering method 600 may be conducted inconjunction with the graphics engine 480. Specifically, the generatestep 620 is preferably conducted by the pixel generators 510 and theocclusion detectors 520. The sort step 630 and the z-buffer region step640 are preferably conducted in conjunction with the sorting z-buffer530. The sort step 650, the retrieve step 660 and the colorize step 670are in one embodiment conducted in conjunction with the colorizationengine 550 and the pixel attribute memory 580. Lastly, the filter step680 is preferably conducted in conjunction with the anti-aliasing filter570.

FIG. 7 is a schematic block diagram depicting one embodiment of thepixel generators 510 of FIG. 5. As depicted, the pixel generators 510include a plurality of patch tessilators 710, triangle pixelizers 720,and voxel ray tracers 730. The pixel generators 510 receive the objectdescriptors 212, and coordinate with the occlusion detectors 520 via anocclusion bus 702, to generate the potentially visible pixels 512.

In one embodiment, the object descriptors 212 received by the patchtessilator 710 describe surface patches such as bezier patches. Thepatch tessilator 710 converts the surface patches into triangledescriptors 712. The triangle pixelizers 720 receive the triangledescriptors 712 from the patch tessilator 710 or the object descriptors212 that describe triangles from a module such as the display memory210. The triangle pixelizers 720 in turn provide the potentially visiblepixels 512.

The voxel ray tracers 730 receive the object descriptors 212 thatdescribe or reference voxel objects. Voxel objects are essentiallythree-dimensional bitmaps that may include surface normal informationfor each voxel. The voxel ray tracers 730 conduct ray tracing operationsthat sample voxel objects to provide the potentially visible pixels 512.

The patch tessilators 710 and the triangle pixelizers 720 are exemplaryof the architecture of the pixel generators 510. Pixelizers such as thetriangle pixelizers 720 receive primitive objects and convert theobjects to pixels. The voxel ray tracer 730 is also a pixelizer in thatvoxels are primitive objects, and the voxel ray tracer 730 providespotentially visible pixels 512. In contrast to pixelizers, converterssuch as the patch tessilators 710 receive non-primitive objects andconvert them to primitive objects that are then processed by pixelizers.Other types of converters and pixelizers may be used within the pixelgenerators 510.

Table 1 depicts one embodiment of a pixel descriptor used in conjunctionwith certain embodiments of the present invention. The pixel descriptormay be dependent on the particular type of graphical object 100 that isbeing processed. For instance, pixel descriptors containing datacorresponding to patch objects may differ in structure from pixeldescriptors containing data corresponding to voxel objects.

In certain embodiments, the various elements of the graphics engine 480and the graphical rendering method 600 reference or provide informationto the pixel descriptor. For example, in the preferred embodiment, thepixel generators 510 may provide the X,Y location of the pixel withinthe tile, the Z depth value, the I.D. of the object that generated it,the U, V texture coordinates, and then X, nY, nZ surface normal values,while the pixel colorizers 555 provide the R, G, and B values. Pixelsgenerated from voxel objects may not utilize all of the fields, such asthe surface normal information that may be looked up after thez-buffering stage. The pixel descriptor is preferably dynamic in thatfields are added or deleted as required by the stage of the pipelineworking with it. TABLE 1 Pixel Descriptor R, G, B Color Index X, Y, Z U,V nX, nY, nZ Object ID

In one embodiment, the pixel descriptor is used to represent thepotentially visible pixels 512, the visible pixels 532, and colorizedpixels 552. Using a pixel descriptor facilitates a decentralizedarchitecture for the graphics engine 480, such as the flow-thruarchitecture described in conjunction with FIG. 5. The pixel descriptorshown in Table 1 includes values for the device component colors such asthe Red, Green, and Blue color values shown in conjunction with therendered color 108 depicted in FIG. 1 a. Also included are a color indexfor the object color, the X, Y, and Z coordinates for the particularpixel, a pair of texture map coordinates U, V, and surface normalinformation nX, nY, and nZ.

Referring to FIG. 8 a, one embodiment of the triangle pixelizer 720includes a span generator 810 and a span converter 820. The spangenerator 810 receives the triangle descriptors 712 or the objectdescriptors 212 that describe triangles and provides a set of spans 812that are enclosed by the described triangles. In certain situations, thespan generator 810 may not generate any of the spans 812. For example, atriangle on its edge may be too thin, and some triangles may be toosmall to enclose any spans 812.

In the depicted embodiment, the span generator 810 provides a pixel setdescriptor 514 to the occlusion detector 520. In return, the occlusiondetector 520 provides the pixel set mask 522 indicating which pixelswithin the pixel set are potentially visible. In one embodiment, thespan generator 810 ensures, via the occlusion detector 520, that thespans 812 are pixel spans in which no pixels are known to be occluded.If not, the span generator 810 may restrict or subdivide the spans 812,such that no pixels therein are known to be occluded. The span converter820 receives the spans 812 and converts the spans into individualpixels, i.e., the potentially visible pixels 512.

FIG. 8 b is a flow chart diagram depicting one embodiment of a trianglepixelization method 830 of the present invention. The trianglepixelization method 830 includes a start step 835, a generate spans step840, a pixelize spans step 850, and an end step 855. The generate spansstep 840 converts the object descriptor 212 into the spans 812. In oneembodiment, the spans 812 containing pixels that are known to beoccluded may be subdivided into spans 812 in which no pixels are knownto be occluded.

The pixelize spans step 850 converts the spans 812 into individualpixels to provide the potentially visible pixels 512. The trianglepixelization method 830 may be appropriate for objects other thantriangles. The triangle pixelization method 830 may be conductedindependently of, or in conjunction with, the triangle pixelizer 720.

FIG. 8 c depicts the results typical of the triangle pixelization method830. An object boundary 860 is defined by connecting a set of objectvertices 862. The object boundary 860 encompasses a set of pixels 864that are within the object boundary. The generate spans step 840converts the object descriptor 212 into the spans 812. For example,spans may be computed using geometric formulas that calculate theminimum and maximum x values for each pixel scanline using slopeinformation. The minimum and maximum x values correspond to a startpixel and an end pixel of the span 812.

Referring now to FIG. 9, one embodiment of a ray tracing apparatus 900includes a bundle caster 910, a proximity tester 920, a ray caster 930,and a collision tester 940. The ray tracing apparatus 900 may be used toembody the voxel ray tracers 730 of FIG. 7. The bundle caster 910receives the object descriptor 212 and provides one or more proximaterays 912. The ray caster 930 receives the proximate rays 912 andprovides the potentially visible pixels 512.

The bundle caster 910 recursively advances a position 914 of a raybundle. The proximity tester 920 receives the position 914 and returns ahit signal 922 if the position 914 is proximate to an object of interestor a portion thereof, such as individual voxels. In one embodiment, theobject of interest is a voxel object, the position 914 advances adistance that corresponds to a proximity distance used by the proximitytester 920, and the recursive advancement of the position 914 terminatesupon assertion of the hit signal 922. The ray bundle that is advanced bythe bundle caster corresponds to a screen area or region within thegraphical scene 150.

In the depicted embodiment, the bundle caster provides an individual ray912 to the ray caster 930. The ray caster 930 recursively advances aposition 932 of an individual ray. The collision tester 940 receives theposition 932 and returns a hit signal 942 if the position 932 impingesupon an object of interest. In one embodiment, the object of interest isa voxel object, and the recursive advancement of the position 932terminates upon assertion of the hit signal 942.

In the depicted embodiment, the bundle caster 910 and the ray caster 930communicate with the occlusion detector 520 via the occlusion bus 702which in one embodiment carries the pixel set descriptor 514 and thepixel set mask 522. The position 914 that is advanced by the bundlecaster 910 and the position 932 that is advanced by the ray caster 930each have a depth component that corresponds to a pixel depth within thegraphical scene 150.

The bundle caster 910 and the ray caster 930 provide information to oneor more occlusion detectors sufficient to ascertain which rays have apixel depth greater than the current occlusion depth. The pixels thatare potentially visible are provided by the ray caster 930 as thepotentially visible pixels 512.

In one embodiment, the ray caster 930 informs the occlusion detector 520via the occlusion bus 702 regarding the depth at which occlusion occurs,i.e., the depth at which an object of interest is impinged. In thepreferred embodiment, the occlusion detector 520 uses the depthinformation to ascertain the occluded pixels and to update the currentocclusion depth for each pixel position within the pixel set.

Referring to FIG. 10 a, one embodiment of the proximity tester 920includes a mask index calculator 1010, a proximity mask cache 1020, andan external memory 1030. The caching architecture of the proximitytester 920 reduces the required size of local storage such as on-chipmemory. The caching architecture also allows facilitates the use ofslower non-local memory, such as off-chip memory, and lowers the accessbandwidth required of the non-local memory since only the data likely tobe used need be brought on-chip.

The mask index calculator 1010 receives the position 914 and computes anindex 1012 corresponding to the position 914. The proximity mask cache1020 contains bit fields indicating the positions that are proximate orwithin an object of interest. The indexed mask bit is preferably withinthe proximity mask cache 1020 and is used to provide the hit signal 922.If the mask bit corresponding to the index 1012 is not within theproximity mask cache 1020, the proper mask bit is retrieved via theexternal memory 1030.

Referring to FIG. 10 b, one embodiment of a collision tester 940includes a subblock index calculator 1040, a subblock register 1050, asubblock cache 1060, and an external memory 1070. The collision tester940 partitions collision bits indicating the positions in renderingspace that an object of interest occupies into three-dimensionalsubblocks such as a 4×4×4 grid of collision bits.

To increase the hit rate within the subblock cache 1060 and tofacilitate efficient memory transfers, the various functional units ofthe collision tester 940 operate on a subblock basis using a subblock1062. The use of subblocks and a subblock cache within the collisiontester 940 facilitates the use of slower non-local memory, such asoff-chip memory, and lowers the access bandwidth required of thenon-local memory. Subblocks also reduce the required size of localstorage such as on-chip memory. In the preferred embodiment, the use ofsubblocks and the subblock cache 1060 within the collision tester 940allows the mask tests to be conducted very quickly since the subblock inuse is stored locally to the ray caster.

The subblock index calculator 1040 receives the position 932 andcomputes a subblock index 1042 as well as a bit index 1044. The subblockindex 1042 is received by and used to access the subblock cache 1060. Ifthe referenced subblock 1062 is within the cache, it is provided to thesubblock register 1050. If not, the referenced subblock 1062 isretrieved from the external memory 1070 and is provided to the subblockregister 1050. The bit index 1044 is used to address specific collisionbits within the subblock register 1050 and to provide the hit signal942.

Referring to FIG. 11, one embodiment of a caster 1100 includes a set ofregister files 1110 and a set of ALU's 1120 to compute the x, y, z, anddepth coordinates of a ray or ray bundle. The caster 1100 may be used toembody the bundle caster 910 and/or the ray caster 930. The architectureof the caster 1100 facilitates using a wide variety of algorithms whenconducting casting. The caster 1100 is particularly well suited toconducting vector-based casting algorithms.

The register files 1110 contain variables used in casting such asposition, casting distance, vectors in the view direction, sidewaysvectors in the down and right direction, and the like. A register bus1112 provides the contents of the registers within the register file1110 to a scalar multiplier 1140 and one port of the ALU 1120. The ALU1120 conducts standard arithmetic functions such as addition andmultiplication and provides the results to a results bus 1122.

The scalar multiplier 1130 receives the contents of the register bus1112 and provides a scaled result 1132 to the other port of the ALU1120. The scalar multiplier may be used to reference individual rays orsubbundles within a ray bundle, to translate or side-step theirpositions by multiplying a ray offset by a scalar value, and to add theresult to a ray position. In one embodiment, the caster 1100 is a raycaster requiring no ray translation and the scalar multiplier 1130 issimply a pass-through register.

Referring to FIG. 12, one embodiment of a ray casting method 1200 of thepresent invention encompasses both bundle casting and individual raycasting. The ray casting method 1200 may be conducted in conjunctionwith or independent of the bundle caster 910, the ray caster 930, andthe caster 1110. The ray casting method 1200 commences with a start step1205 followed by a provide step 1210. The provide step 1210 provides aray bundle, which in one embodiment requires initializing a positionvector at the focal point 114 in a direction determined by theperspective viewer 106.

The ray casting method 1200 proceeds from the provide step 1210 to aproximity test 1215. The proximity test 1215 ascertains whether the raybundle is proximate to an object of interest. In one embodiment, theproximity test comprises accessing a mask array in conjunction with theproximity tester 920 shown in FIG. 10 a and referenced in FIG. 9. Inanother embodiment, the proximity test comprises accessing a distancearray or grid that indicates the shortest distance from each x,y,zposition to the graphical object 100.

If the proximity test 1215 is false, the ray casting method 1200proceeds to an advance bundle step 1220. The advance bundle step 1220adds a first casting distance to the ray bundle position. In certainembodiments, the advance bundle step 1220 is followed by an occlusiontest 1225, which in one embodiment is conducted by the occlusiondetector 520.

The occlusion test 1225 ascertains whether the entire ray bundle isknown to be occluded (by other objects.) If so, the ray casting method1200 terminates at an end step 1230. Otherwise, the method loops to theproximity test 1215. In certain embodiments, for instance when anapparatus has ample casting resources and scarce occlusion testingresources, the occlusion test 1225 is not conducted with every castingloop of the ray casting method 1200.

If the proximity test 1215 is true, the ray casting method 1200 proceedsto a subdivide step 1235. The subdivide step 1235 divides the ray bundleinto subbundles and continues by processing each sub-bundle. Subdividingrequires computing and adding a horizontal and vertical offset (i.e.adding a subbundle offset) to the position of the bundle that issubdivided. Subdividing also requires computing a new directional vectorin those instances involving perspective rendering. In the preferredembodiment, computing and adding the horizontal and vertical offset isconducted in conjunction with the scalar multiplier 1130 and the ALU1120.

In certain embodiments, the subdivide step 1235 retreats or advances theray bundle a second casting distance to ensure proper proximity testing,facilitate longer casting distances and reduce the average number ofproximity tests. In one embodiment, the subdivide step retreats a secondcasting distance, and the average number of proximity and collisiontests per ray intersection on typical data was found to be less thaneight.

In one embodiment, the subdivide step 1235 comprises activatingsubdivided or child bundles while continuing to conduct casting of thecurrent (parent)<bundle. Continuing to conduct casting requiresproceeding to the advance bundle step 1220 even when the proximity test1215 is true. Continued casting of the parent bundle is useful when somerays may not collide with the object(s) whose proximity is being tested.Continued casting facilitates termination of the child bundles (i.e.rebundling of the children into the parent) when the proximity test 1215is once again false, thus reducing the required number of proximitytests.

The subdivide step 1235 is followed by the single ray test 1240, whichascertains whether the subdivided bundle contains a single ray. If not,the ray casting method 1200 loops to the proximity test 1215. Otherwise,the method 1200 proceeds to a collision test 1245. The collision test1245 ascertains whether the individual ray has collided with an objectof interest such as the graphical object 100. In one embodiment, thecollision test comprises accessing a mask array in conjunction with thecollision tester 940 shown in FIG. 10 a and referenced in FIG. 9. If thecollision test 1245 is false, the ray casting method 1200 proceeds to anadvance ray step 1250.

In one embodiment, the advance ray step 1250 adds a first castingdistance to the individual ray position. In another embodiment, theadvance ray step 1250 computes the distance to the next intersectedvoxel of a voxel object, and advances that distance. In certainembodiments, the advance bundle step 1220 is followed by an occlusiontest 1255, which in one embodiment is conducted by the occlusiondetector 520. In certain embodiments, the occlusion test 1255 ispreferably conducted in conjunction with the subdivide step 1235.

The occlusion test 1255 ascertains whether the individual ray is knownto be occluded (by other objects.) If so, the ray casting method 1200terminates at an end step 1260, otherwise the method 1200 loops to thecollision test 1245. In certain embodiments, the occlusion test 1255 isnot conducted for every loop of the advance ray step 1250.

The best placement and frequency of conducting the occlusion test 1225and 1255 within the ray casting method 1200 may beapplication-dependent. In particular, the frequency of testing may beadjusted in response to resource availability such as processing cycleswithin the occlusion detector 520. In certain embodiments, the occlusiontest 1225 and 1255 are preferably conducted in conjunction with theprovide step 1210 and the subdivide step 1235 rather than after theadvance bundle step 1220 and the advance ray step 1250.

FIG. 13 a is a flow chart diagram depicting one embodiment of aproximity mask generation method 1300 in accordance with the presentinvention. The generated proximity mask and associated collision maskare preferably used in conjunction with the ray casting method 1200.FIGS. 13 b through 13 g are a series of two-dimensional illustrationsdepicting examples of the results of the proximity mask generationmethod 1300. The illustrations are presented to enable one of ordinaryskill in the art to make and use the invention.

The graphical object 100 shown in FIG. 13 b may be a voxel objectcomprised of three-dimensional cubes or voxels. For simplicity, aprofile view was selected to restrict the illustration to twodimensions. A voxel object is essentially a three-dimensional bitmapwherein each cell or cube is assigned a color or texture along with asurface normal to indicate the directionality of the surface.

After starting 1310, the proximity mask generation method 1300 proceedsby converting 1320 the graphical object 100 to a collision mask 1322 atthe highest resolution available. Converting a voxel object to acollision mask involves storing a single bit for each voxel or cell,preferably in a compressed format.

After creating the collision mask 1322, the proximity mask generationmethod 1300 proceeds by horizontal copying 1330 the collision mask 1322in each horizontal direction to create a horizontally expanded mask 1332shown in FIG. 13 d. The horizontal copying 1330 is followed byvertically copying 1340 the horizontally expanded mask 1332 in eachvertical direction to create a vertically expanded mask 1342 shown inFIG. 13 e. In one embodiment, horizontal and vertical copying involves ashift operation followed by a bitwise OR operation.

The result of horizontal and vertical expansion is the proximity mask1344 shown in FIG. 13 f. In the depicted illustrations, the amount ofhorizontal and vertical expansion is two voxels and the proximity mask1344 indicates a proximity of two voxels. After horizontal and verticalexpansion, the proximity mask generation method 1300 optionally, andpreferably, continues by reducing 1350 the resolution of the proximitymask 1344 to produce a lower resolution proximity mask 1352 shown inFIG. 13 g. In the depicted embodiment, reducing 1350 comprises ORingproximity mask data from 2×2×2 grids of adjacent cells into the larger(lower resolution) cells of the lower resolution proximity mask 1352.The proximity mask generation method 1300 then terminates 1360.

FIG. 14 is an illustration depicting the operation of one embodiment ofthe ray casting method 1200 in conjunction with several proximity masksand a collision mask. The illustration of FIG. 14 is intended to be anon-rigorous depiction sufficient to communicate the intent of theinvention. In the depicted operation, the object of interest is a chair.

During the advancement of the ray bundles and individual rays, occlusiontests may be conducted to ascertain whether the object of interest isoccluded by other graphical objects at the current position of the raybundle or individual ray. A parent bundle 1410 with an initial position1412 is tested against a first proximity mask 1420. The proximity testis false resulting in the parent bundle 1410 being cast a first castingdistance 1430. The first casting distance 1430 preferably correspondswith the resolution of the first proximity mask 1420 such that visibleobjects will not be skipped.

In the depicted operation, the parent bundle 1410 advances to a secondposition 1414, whereupon another proximity test is conducted. Theproximity test at the second position 1414 yields a false result,causing the parent bundle 1410 to advance to a third position 1416. Asdepicted, the proximity test at the third position 1416 is true,resulting in sub-dividing of the parent bundle 1410 into child bundles1440.

In the depicted operation, the process of testing and subdividing isrepeated for a second proximity mask 1422 using a second castingdistance 1432, a third proximity mask 1424 using a third castingdistance, and so forth, until the bundles are subdivided into individualrays. The individual rays are then tested against a collision mask 1450where a true result indicates impingement upon a potentially visibleobject. During the advancement of the ray bundles and individual rays,occlusion tests may be conducted to ascertain whether the object ofinterest is occluded by other graphical objects at the current positionof the ray bundle or individual ray.

FIGS. 15, and 16 are illustrations depicting the operation of the raycasting method 1200 of the present invention. Referring to FIG. 15 a, aray bundle 1510 comprises individual rays 1511 and occupies a volume1512 in rendering space. In the depicted embodiment, the volume 1512 isa cube with a width 1514, a height 1516, and a length 1518. An object ofinterest 1520 is subject to proximity tests of various distances.Successful casting requires choosing a selected proximity 1530, whichensures that the object of interest 1520 is not skipped when within thegraphical scene 150, and that a casting distance 1535 is notunnecessarily short. In one embodiment, the selected proximity 1530corresponds to an enlarged object of interest 1520 a.

Proper proximity testing requires that the selected proximity 1530,i.e., the amount of enlargement used in creating a proximity mask, isgreater than a distance 1540 from a testing position 1550 to thefurthest point within the volume 1512. The selected proximity 1530 musttherefore be greater than or equal to the distance 1540, and the testingposition 1550 is preferably in the center of the volume 1512.

Referring to FIG. 16, a ray bundle 1610 may be comprised of divergingrays 1612 that originate from the focal point 114 of the perspectiveviewer 106 shown in FIG. 1 a. With diverging rays, the volume 1512increases with each successive cast due to the increase in width 1514and height 1516. In one embodiment, proper proximity testing ismaintained by recalculating the distance 1540 and selecting a proximitymask with an object enlargement that is greater than or equal to thedistance 1540.

Referring to FIG. 17 a, one embodiment of the occlusion detector 520 ofFIG. 5 includes a coarse z-buffer 1710, a comparator 1720, and aregister 1730. The coarse z-buffer 1710 is in one embodiment essentiallya specialized memory containing the shallowest known pixel depth foreach pixel position in the graphical scene 150. The shallowest knowndepth is the shallowest depth encountered at each pixel position for thepixels that have already been processed by the occlusion detector 520.The shallowest known pixel depth is referred to herein as the currentocclusion depth.

The data bus 1712 carries the depth information that is stored withinthe coarse z-buffer. In one embodiment, the data bus 1712 is a parallelbus that is capable of accessing an entire row of depth informationwithin the coarse z-buffer 1710. In another embodiment, the data bus1712 (and the pixel set mask 522) is a convenient width such as 32 bitsand multiple accesses must be conducted to access an entire row of depthinformation. The entire row of depth information preferably correspondsto a row of pixels within the graphical scene 150. The depth informationis preferably coarse, i.e., of a reduced resolution in that completepixel pruning is not required by the occlusion detector 520.

Using coarse depth information (i.e., a reduced number of bits torepresent the depth) facilitates pruning the majority of occluded pixelswhile using a relatively small memory as the coarse z-buffer 1710. Inone embodiment, the coarse z-buffer 1710 is used in conjunction withdepth shifting in which graphical rendering is localized to a specificdepth range and the display lists are sorted in depth (front-to-back)order to facilitate depth localization.

Depth shifting or depth localization is a method developed inconjunction with the present invention to maximize the usefulness of thecoarse z-buffer. Depth shifting comprises shifting a depth range duringthe rendering process thereby focusing the resolution of the coarsez-buffer to a particular range of z values. In the preferred embodiment,a current minimum depth is maintained along with a current coarseness,for example, a multiplier or exponent, indicating the resolution of thez values stored within the coarse z-buffer. Depth shifting is preferablyconducted in conjunction with depth ordered rendering, and the currentcoarseness is adjusted to match the density of primitives being renderedat the current depth.

In one embodiment, depth shifting comprises subtracting an offset fromeach z value within the z-buffer, with values below zero being set tozero. In another embodiment, depth shifting comprises subtracting anoffset as well as bit shifting each of the z values to change thecurrent coarseness of values contained in the coarse z-buffer. In yetanother embodiment, depth shifting comprises adding an offset to thevalues in the course z-buffer and setting overflowed depths to a maximumvalue and underflowed depths to a minimum value. In the presentlypreferred embodiment, the maximum z value represented in the coarsez-buffer indicates a location containing no pixel data, while theminimum value of zero represents a pixel generated at a shallower depththan the current minimum depth.

The present invention may be embodied in other specific forms withoutdeparting from its spirit The register 1730 receives a pixel setdescriptor 514 including depth information. In one embodiment, the pixelset descriptor 514 describes a horizontal span of consecutive pixels.The register 1730 provides the pixel set descriptor to the comparator1720.

The comparator 1720 compares the minimum depth for the pixel set witheach pixel's occlusion depth by accessing the occlusion depth for eachpixel within the pixel set via the data bus 1712. The comparator 1720provides the pixel set mask 522 indicating which pixels within the pixelset are known to be occluded. In the preferred embodiment, thecomparator 1720 also compares the maximum depth for the pixel set witheach pixel's occlusion depth and updates the contents of the z-buffer ifthe maximum depth is shallower than the current occlusion depth.

Referring to FIG. 17 b, one embodiment of an occlusion detection method1740 may be conducted in conjunction with the generate step 620 of thegraphical rendering method 600 of the present invention. The occlusiondetection method 1740 may also be conducted in conjunction with theocclusion detector 520. In the preferred embodiment, the occlusiondetection method 1740 is used to conduct gated pixelization such thatpixels that are known to be occluded are not included in subsequentrendering stages.

The occlusion detection method 1740 begins with a start step 1750followed by a receive step 1755. The receive step 1755 receives a pixelset descriptor, such as the pixel set descriptor 514, that describes theextents of the pixel set being processed in conjunction with a graphicalobject such as the graphical object 100. The pixel set descriptorpreferably includes depth information such as maximum and minimum depth.In one embodiment, the pixel set descriptor enumerates the starting andending pixels of a span along with minimum and maximum depths.

The occlusion detection method 1740 facilitates specifying a depth rangerather than requiring exact depth information for each pixel in thepixel set of interest. In most cases, a depth range comprising minimumand maximum depths is sufficient to prune a majority of non-visiblepixels and update the occlusion depth. While the occlusion detectionmethod 1740 may be used in a single pixel mode that specifies an exactpixel depth, the preferred embodiment comprises specifying a depth rangefor an entire set of pixels. Specifying a depth range for an entire setof pixels reduces the data bandwidth required to conduct occlusiondetection.

The occlusion detection method 1740 proceeds from the receive step 1755to a retrieve step 1760. The retrieve step 1760 retrieves the occlusiondepth for the locations described by the pixel set descriptor. In oneembodiment, the retrieve step 1760 is conducted by the comparator 1720in conjunction with the coarse z-buffer 1710.

After the receive step 1755, the occlusion detection method 1740conducts a minimum depth test 1770 on each pixel in the described pixelset. The minimum depth test 1770 ascertains whether the occlusion depthfor a particular pixel location is less than the pixel set minimum. Ifso, the set flag step 1775 is conducted. Otherwise, a maximum depth test1780 is conducted. The set flag step 1775 sets a flag for each pixelthat passes the minimum depth test 1770. The pixels that pass theminimum depth test 1770 are known to be occluded, while the remainingpixels are potentially visible.

If the minimum depth test 1770 is false for some or all of the pixels inthe pixel set of interest, the maximum depth test 1780 is conductedpreferably only on those pixels that fail the minimum depth test 1770.The maximum depth test 1780 ascertains whether the occlusion depth for aparticular pixel location is greater than the pixel set maximum. If so,the particular pixel is shallower than the occlusion depth and an updatestep 1785 is conducted to update the occlusion depth.

The maximum depth test 1780 and the update step 1785 ensure that theocclusion depth is only decreased and will not be increased whileprocessing a graphical scene or frame. Successful occlusion depthupdates are contingent on the maximum depth being valid for the entireset of pixels being considered. In those situations where it is notknown if the graphical object occludes the entire set, such as certainembodiments of the ray casting method 1200, occlusion depth updates maybe deferred until an actual ray collision occurs thereby removinguncertainty and possible erroneous updates. After the update step 1785,the occlusion detection method 1740 then loops to the receive step 1755to process other objects and pixel sets.

Bucket sorting is an efficient method of sorting data elements that usea data key or portion thereof to index into a set of buckets followed byplacement of the data elements within the indexed buckets. Sortingpostal mail into zip codes is an example of the concept of bucketsorting. Bucket sorting is preferably conducted on a coarse basis toreduce the number of buckets to a manageable level. Multiple passes maybe conducted to achieve finer sorting.

Referring to FIG. 18 a, one embodiment of a bucket sorter 1800 includesa memory array 1810 comprised of multiple array columns 1820. The arraycolumns 1820 each send and receive data via a column bus 1822 to andfrom a memory buffer 1830. The memory buffers 1830 are also connected toa bi-directional memory bus 1840.

The memory bus 1840 provides an interface to a set of bucket buffers1850. In the depicted embodiment, some of the bucket buffers 1850 arebucket write buffers 1850 a, while others are bucket read buffers 1850b. The bucket write buffers 1850 a receive data and control informationfrom a bucket controller 1860 via a set of sorter input ports 1852 a.The bucket read buffers 1850 b receive control information and providedata to the bucket controller 1860 through a set of sorter output ports1852 b.

The bucket buffers 1850 are essentially cache memory for the memoryarray 1810 that is under intelligent control of the bucket controller1860. The bucket controller 1860 orchestrates the movement of datawithin the bucket sorter 1800 to effect sorting operations. Thearchitecture of the bucket sorter 1800 facilitates sorting data that isalready within the memory array 1810. In certain embodiments, multiplesorting passes may be conducted on data within the memory array 1810. Inone embodiment, one or more of the bucket write buffers 1850 a is amiscellaneous bucket that is resorted after the initial sort. The bucketcontroller 1860 receives and provides bucket data externally through aset of bucket ports 1862 that, in the depicted embodiment, arepartitioned into bucket write ports 1862 a and bucket read ports 1862 b.

In one embodiment, the bucket controller 1860 assigns bucket ID's toeach bucket buffer and transfers filled bucket write buffers 1850 a tothe memory array 1810 via a memory buffer 1830 and fills empty bucketread buffers 1850 b in like fashion. The memory bus 1840, the memorybuffer 1830, the column bus 1822, and the array columns 1820 arepreferably wide enough to transfer an entire bucket buffer in one buscycle.

The bucket controller 1860 is preferably equipped with a mechanism totrack the placement of bucket data within the memory array 1810. In oneembodiment, the tracking mechanism references a memory assignment table,while in another embodiment the tracking mechanism manages a set oflinked lists. The bucket controller 1860 may dedicate particular bucketbuffers 1850 to store tracking data. The bucket controller 1860 may alsostore tracking data within the memory array 1810. The components of thebucket sorter 1800 may be partitioned into a memory 1800 a and a sorter1800 b.

FIG. 18 b shows additional detail of specific elements related to anon-chip embodiment of the bucket sorter 1800. The depicted embodiment isconfigured to utilize embedded DRAM using wide data paths to increaseavailable bandwidth and bucket sorting performance. In the depictedembodiment, each memory buffer 1830 includes multiple sense amps 1830 a,one or more transfer registers 1830 b, and a data selector 1830 c. Inone embodiment, the selectors comprise an multiplexor.

The depicted bucket buffers 1850 comprise an N bit interface to a bucketbus 1852 and an M×N bit interface to the memory bus 1840. In thedepicted embodiment, each of the K bucket buffers 1850 may transfer datato and from the bi-directional memory bus 1840. In the preferredembodiment, the bits of the bucket buffer are interleaved to facilitatebit alignment and to reduce wiring complexity. For example, with abucket buffer of M locations of N bit words, the bits of the bucketbuffer are arranged such that the bit cells of the least significantbits from each of the M memory locations are located on one end of thebucket buffer, while the bit cells of the most significant bits arelocated on the other end of the bucket buffer. Such an arrangementfacilitates efficient routing of the bitlines from the sorter parts1852.

The data selectors 1830 c direct the M×N bits of the memory bus 1840 toand from one of J sets of one or more transfer registers 1830 b. Eachset of the transfer registers 1830 b hold data for one or more datatransfers to and from the memory array 1810. The memory transfers alsopass through the sense amps 1830 a.

With the depicted organization, the selectors 1830 c are preferablyconfigured as N×M, J-to-1 single bit selectors, where each of the N×Msingle bit data selectors transfers (and aligns) one bit from the memorybus 1840 to and from a corresponding bit of one of J transfer registers1830 b. The J transfer registers in turn are aligned with, andcorrespond to, the J sense amp arrays 1830 a and the J column arrays1820 of the memory 1810.

For clarity purposes, the column or rays 1820, the sense amps 1830 a,and the transfer registers 1830 b are shown logically in separatecolumns. In the actual physical layout of the aforementioned elements,the bit columns are interleaved such that each element spans the widthof the memory array 1810.

The depicted organization facilitates alignment of the data bits fromthe bucket buffers 1850 with those of the memory array 1810, therebyminimizing on-chip real estate dedicated to wiring paths between thedepicted elements.

Referring to FIG. 19, one embodiment of a bucket sorting method 1900 maybe conducted independently of or in conjunction with the bucket sorter1800. The bucket sorting method 1900 commences with a start step 1910followed by an allocate step 1920. The allocate step 1920 allocatesstorage regions within a memory such as the memory array 1810 that areassigned to specific “buckets.”

Bucket buffers such as the bucket buffers 1850 may also be assigned tobuckets, although in certain embodiments there are fewer bucket buffersthan actual buckets. In these embodiments, some bucket buffers may beassigned to a “miscellaneous” or “other”bucket whose contents must beresorted when additional bucket buffers are available. Sorting may alsobe conducted recursively by dividing available bucket buffers intogroups for example by sorting on a sorting key one bit at a time.

The bucket sorting method 1900 proceeds from the allocate step 1920 to aroute step 1930. The route step 1930 writes a data element within thebucket write buffer 1850 a that corresponds to a data key. The dataelement may be received via one of the bucket write ports 1862 a, andfor example, may be received from an external functional or one of thesorter output ports 1852 b, such as when recursively sorting data. Thedata key may be part of the data element or the data key may be providedseparately. After the route step 1930, the bucket sorting method 1900proceeds to a buffer full test 1940.

The buffer full test 1940 ascertains whether the buffer that was writtento is full. In one embodiment, the buffer full test comprises checking asignal from the particular bucket write buffer 1850 a. If the bufferfull test is not true, the bucket sorting method 1900 loops to the routestep 1930. Otherwise, the method proceeds to an empty buffer step 1950.

The empty buffer step 1950 transfers the contents of a bucket buffersuch as the bucket buffer 1850 to a region of memory associated with aparticular bucket. In certain embodiments, the empty buffer step 1950 isfollowed by a bucket full test 1960. The bucket full test 1960ascertains whether the region of memory associated with a particularbucket is full.

If the tested bucket is full, the bucket sorting method 1900 loops tothe allocate step 1920 where in one embodiment additional memory isallocated. Otherwise, the bucket sorting method 1900 loops to the routestep 1930 to process additional data elements. The buffer full test1940, the empty buffer step 1950, and the bucket full test 1960 arepreferably conducted in parallel for each bucket buffer.

Referring to FIG. 20 a, one embodiment of the sorting z-buffer 530 usesthe bucket sorter 1800 to embody the sorting z-buffer 530. Specifically,the region sorter 535 comprises the bucket buffers 1850 and the bucketcontroller 1860, while the region memory 540 comprises the memory array1810 and the read/write buffers 1830.

Referring to FIG. 20 b, one embodiment of a sorting z-buffer method2000×of the present invention may be used in conjunction with, orindependently of, the sorting z-buffer 530. The sorting z-buffer method2000 commences with a start step 2010, followed by a sort step 2020. Thesort step 2020 sorts pixels such as the potentially visible pixels 512into regions. In one embodiment the regions are a rectangular region ofthe graphical scene 150 that is a small portion of the tile 310 and thesort step 2020 is conducted by the bucket sorter 1800.

The sort step 2020 is followed by a z-buffer step 2030. The z-bufferstep 2030 maintains the shallowest pixel for each x,y position with aregion. The z-buffer step 2030 processes the pixels for an entire regionresulting in visible pixels for the processed region such as the visiblepixels 532.

The sorting z-buffer method 2000 proceeds from the z-buffer step 2030 toa regions processed test 2040. The regions processed test 2040ascertains whether all the sorted regions have been processed by thez-buffer step 2030. If not, the sorting z-buffer method 2000 loops tothe z-buffer step 2030. Otherwise, the sorting z-buffer method 2000terminates 2050.

Referring to FIG. 21 a, one embodiment of a graphics memory localizer2100 increases the locality of memory accesses and includes a requestsorter 2110, a set of page access queues 2120, and a graphics memory2130. The request sorter 2110 may be embodied as the sorter 1800 b,while the page access queues may be embodied as the memory 1800 a. Thegraphics memory 2130 may be embodied as random access memory comprisedof internal and external DRAM.

The request sorter 2110 receives an access request 2108, which in oneembodiment comprises an address field, a data field, and an operationfield. Multiple access requests 2108 are received and sorted into thepage access queues 2120 via an access bus 2122. The request sorter 2110also retrieves sorted requests from the page access queues and directsthe sorted requests to the graphics memory 2130 via the memory bus 1840.Sorting the memory access requests into page queues facilitatesincreased page hits within the graphics memory 2130, thereby increasingthe rendering performance within a graphical system. The graphics memory2130 provides data to a data bus 2132.

Referring to FIG. 21 b, one embodiment of a graphics memory localizationmethod 2150 may be conducted independently of, or in conjunction with,the graphics memory 2100. The graphics memory localization method 2150commences with a start step 2155 followed by a sort step 2160. The sortstep 2160 sorts a preferably large number of access requests into a setof page queues. The sort step 2160 is followed by a process queue step2170.

The process queue step 2170 processes the requests from one page queue.When conducted in conjunction with cached or paged memory, processingthe requests from a single page queue results in sustained cache or pagehits. By sorting access requests, the graphics memory localizationmethod 2150 significantly increases the level of performance attainablewith memory subsystems such as, for example, a subsystem using page modeDRAM or the like wherein localized (i.e., page mode) memory accesses aremuch faster than non-localized (i.e., normal) memory accesses.

The graphics memory localization method 2150 proceeds from the processqueue step 2170 to a queues processed test 2180. The queues processedtest 2180 ascertains whether all the page queues have been processed. Ifnot, the graphics memory localization method 2150 loops to the processqueue step 2170 otherwise the method terminates 2190.

FIG. 22 relates the certain elements of the graphics engine with thebucket sorter 1800. A pixel colorizer 2200 includes a set of addresscalculators 555 a, a set of attribute processors 555 b, the attributerequest sorter 560, the attribute request queues 565, and the pixelattribute memory 580. The address calculators 555 a and the attributesprocessors 555 b may comprise the pixel colorizers 555 shown in FIG. 5,while the pixel colorizer 2200 may be contained within the graphicsengine 480.

In the depicted embodiment, the pixel colorizer 2200 includes a pixelcombiner 2210. The pixel combiner 2210 is preferred in embodiments thatconduct super-sampled rendering. Super-sampled rendering increasesvisual quality by rendering a set of pixels for each output pixel. Theset of rendered pixels are filtered (i.e., smoothed) to provide eachoutput pixel.

The pixel combiner 2210 examines the visible pixels 532 that comprise asingle output pixel. The pixel descriptors of pixels associated with anoutput pixel are accessed to ascertain whether some or all the pixelsmay be combined into a representative pixel 2212. If not, the visiblepixels 532 are passed along without combining them.

In one embodiment, combining is performed if multiple pixels originatefrom the same patch and texture. In such cases it may not beadvantageous to conduct texture lookups, and shading for all of thosesubpixels, the associated visible pixels 532 are discarded from furtherrendering with the exception of the representative pixel 2212. Therepresentative pixel 2212 is preferably the center pixel in the set ofpixels of the pixels it represents.

In the depicted embodiment, the address calculators 555 a compute amemory address associated with an attribute of interest. The memoryaddress is presented as the attribute request 557. The attribute requestis handled by the request sorter 560 in the manner related in thedescription of FIG. 5 and provides the sorted attribute requests 562.

The attribute processors 555 b receive the visible pixels 532 or therepresentative pixels 2210 along with the pixel attributes 582 andprovide the colorized pixels 552. The colorized pixels 552 may berecirculated within the pixel colorizer 2200 via a recirculation bus2220. Recirculation facilitates the acquisition of additional attributesfor each pixel.

Referring to FIG. 23, one embodiment of a pixel colorization method 2300of the present invention may be conducted independently of, or inconjunction with, the pixel colorizer 2200 or the graphics engine 480.The pixel colorization method 2300 begins with a start step 2310followed by a calculate address step 2320, a sort requests step 2330,and a process queue step 2340.

The calculate address step 2320 computes a memory address for a neededattribute such as a color table entry, a texture map, shading data, andthe like. The needed attributes may be dependent on the type of objectfrom which the pixels originated. The calculate address step 2320 ispreferably conducted for a large number of pixels such as the visiblepixels 532. The pixel colorization method 2300 contributes to thelocalization of memory references by processing the same neededattribute for every pixel in the pixels of interest. Typically,accessing the same attribute focuses the memory references to arelatively small portion of a graphics memory such as the pixelattribute memory 580.

The sort requests step 2330 sorts the preferably large number of thecalculated addresses into page queues to further increase the localityof memory references. The process queue step 2340 accesses a memory suchas the pixel attribute memory 580 with the sorted addresses. In oneembodiment, the process queue step 2340 uses the retrieved attributeinformation to colorize the visible pixels 532.

The pixel colorization method 2300 proceeds from the process queue to aqueues processed test 2350. The queues processed test 2350 ascertainswhether every page queue with a pending request has been processed. Ifnot, the pixel colorization method 2300 loops to the process queue step2340. Otherwise, the method proceeds to an attributes processed test2360.

The attributes processed test 2360 ascertains whether all relevantattributes have been processed for the pixels of interest such as aframe of visible pixels 532. If not, the pixel colorization method 2300loops to the calculate address 2320. Otherwise, the pixel colorizationmethod 2300 terminates at an end step 2370.

The present invention may be embodied in other specific forms withoutdeparting from its spirit or essential characteristics. The describedembodiments are to be considered in all respects only as illustrativeand not restrictive. The scope of the invention is, therefore, indicatedby the appended claims rather than by the foregoing description. Allchanges, which come within the meaning and range of equivalency of theclaims, are to be embraced within their scope.

1. An apparatus for increasing the access locality of graphicalrendering data, and discarding non-visible pixel descriptors in agraphical system, the apparatus comprising: a region memory configuredto store pixel descriptors; a region sorter configured to receive pixeldescriptors including scene position and depth, and to direct the pixeldescriptors to locations within the region memory corresponding toregions within a graphical scene; and a region-sized z-buffer configuredto receive the pixel descriptors and to retain the pixel descriptor withthe shallowest pixel depth for each position within a graphical sceneregion.
 2. The apparatus of claim 1, wherein the z-buffer is furtherconfigured to retain more than one pixel descriptor for each screenposition.
 3. The apparatus of claim 1, wherein the region sortercomprises: at least one sorter input port configured to receiverendering data; a plurality of bucket buffers configured to transfer therendering data to the region memory; and a bucket controller configuredto select a portion of the rendering data as a bucket key and to directthe rendering data from each sorter input port to the bucket buffercorresponding to the bucket key.
 4. A method for increasing the accesslocality of graphical rendering data and discarding non-visible pixelsin a graphical system, the method comprising: sorting pixel descriptorsinto a plurality of regions based on scene position; and processing thepixel descriptors within a region using a region-sized z-buffer.
 5. Themethod of claim 4, wherein sorting the pixel descriptors comprises atleast one level of bucket sorting.
 6. An apparatus for increasing thelocality of references to locations within a graphics memory, theapparatus comprising: a graphics memory configured to store data withina plurality of memory pages; an access request memory partitioned into aplurality of page access queues; and a request sorter configured toreceive requests to access the graphics memory, determine an associatedmemory page, and direct each request to the page access queuecorresponding to the associated memory page.
 7. The apparatus of claim6, wherein the request sorter comprises: at least one sorter input portconfigured to receive rendering data; a plurality of bucket buffersconfigured to transfer the rendering data to the graphics memory; and abucket controller configured to select a portion of the rendering dataas a bucket key and to direct the rendering data from each sorter inputport to the bucket buffer corresponding to the bucket key.
 8. A methodfor increasing access locality within a graphics memory, the methodcomprising: sorting a plurality of access requests into a plurality ofpage queues; and processing the access requests within a page queue. 9.The method of claim 8, wherein sorting a plurality of access requestscomprises at least one level of bucket sorting.
 10. An apparatus forincreasing the access locality of pixel attribute data within agraphical system, the apparatus comprising: a pixel attribute memoryconfigured to store pixel attribute data within a plurality of memorypages; an attribute request memory partitioned into a plurality ofattribute request queues; and a request sorter configured to receiverequests to access a storage location within the pixel attribute memoryand direct the requests to the attribute request queue corresponding tothe storage location.
 11. The apparatus of claim 10, wherein the requestsorter comprises: a plurality of bucket buffers configured to transferdata to the pixel attribute memory; at least sorter input portconfigured to receive rendering data; and a bucket controller configuredto select a portion of the rendering data as a bucket key and to directthe rendering data from each sorter input port to the bucket buffercorresponding to the bucket key.
 12. The apparatus of claim 10, furthercomprising a colorizer configured to receive pixel attribute data andmodify the color of pixels in accordance with the pixel attribute data.13. A method for increasing the access locality of pixel attribute datato efficiently colorize pixels within a graphical system, the methodcomprising: sorting a plurality of pixel attribute requests into aplurality of page queues; processing the pixel attribute requests withina page queue to provide pixel attributes; and colorizing pixels based onthe pixel attributes.
 14. The method of claim 13, wherein sortingcomprises at least one level of bucket sorting.
 15. The method of claim13, wherein colorizing is selected from shading, texturing, color tableindexing, and shadowing.